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ATTTOMATIC r ArHTlMG GENERATION IN 
>JT?TWnRK: APPI.TCATIONS 



5 FTRT .D OF TH F. INVENTION 

[0001] Embodiments of the invention relate to network appUcations; and more 
specifically, to automatic cache generation for network applications. 

10 RArKGROUN^ TWF INVENTION 

[00021 Network processors (NP) are emerging as a core element of high-speed 
communication routers and they are designed specfically for packet processing applications. 
Such appUcations usually have stringent performance requirements. For instance, OC-192 
15 (10 Gigabits/sec) POS (Packet over SONET) packet processing requires a throughput of 28 

mnUon packets per second or service time of 4.57 microseconds per packet for transmission 

and receipt in the worst case. 

100031 On the other hand, the latency for an external memory access m MPs is usually 
larger than the worst-case service time. In onjer to address the umque chaUenge of packet 
20 prccessing.(e.g..main.ainingstabilitywhilemaximizingthroughputandminimi.mglatency 

for the worst.case traffic,) modem network processors usually have a highly parallel 
architecutre. For instance, some network processors, such as. totel IXA NPU family of 
network processors (KP), includes multiple microengines (e.g., programmable processors 
with packet processing capability) runnmg in parallel and each microengine supports 

25 multiple hardware threads. 

[0004] Consequently, the associated network applications are also highly parallel and 
usually multi-threaded to compensate the long memory access latency. Whenever a new 
packet arrives, a senes of tasks (e.g., receipt of the packet, routing table look-up, and g 
enqueueing) is performed on that packet by a new thread. In such a parallel programming OC 

'Lll 
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paradigm, modifications to global resources, such as a location in the shared memory, are 
protected by cntxcal sections to ensure mutual exclusiveness and synchronization between 
threads. 

,00051 Each critical section typically read, a resource, modifies U. and writes i, back 
(RMW) Figt^ 1 is a block diagram iUustiating a conventional external memory accesses 
by multiple threads. As shown in Figure , , if more that one thread is required to modify the 
same critical data, a latency penalty will he incurred for each thread .f each accesses the 
external memory. Referring to Figure 1. each of the threads 101-104 has to he executed m 

, thread 102 has to wait thread 101 to finish the operations read, 

modificatior. and wnte back to the external memory before thread 102 can access the same 
location of the external memory. 



pprec r^csrcTPTION QcTHF nRAWINGS 

,5 100061 The invention may best be understood by referring to the following description 
and accompanying drawings that are used to illustrate embodiments of the invention, h, the 

drawings: 

[00071 Figure 1 is block diagram illustrating a typical external memory access. 
(00081 Figure 2 is a block diagram illustrating an example of an extenral memory access 
20 using software controlled caching according to one embodiment. 

100091 F,gure 3 is ahlock diagram iUustiating an example of acaching mechamsm 

according to one embodiment. 

lOOlOl Figure 4 is ablock diagram illustrating an example of a caching mechanism 
according to an alternative embodiment. 
25 loom Figure 5 is a now diagram illustrating an example ofa process for software 

automatic controlled caching, according to one embodiment. 

loom Figure. 6-8 are block dia^ams of examples of pseudo code iUustratmg software 
conttoUed caching operations according to one embodiment. 
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100131 Figure 9 is a flow diagram illustrating an example of a process to identic- a 
candidate for caching according to one embodiment. 

100141 Figures 10-12 are block diagrams of examples of pseudo code illustrating 
software controlled caching operations according to one embodiment 
5 100151 Figure 13 is a block diagram illustrating an example ofan external memory 

access using software controlled caching accordmg to one embodiment. 

100161 Figure 14 is a block diagram illustrating an example of memory allocations of 

CAM and LM according to one embodiment. 

. H;.<.ram illustrating an example of aprocess for automatic 

[001/] ri^uic I-' loctxiv^-'t X* — ^ 

10 software controlled caching according to one embodiment. 

,00,81 Figure 16 is a flow diagram illustrating an example of a process for maintatnmg 
images of a CAM and/or LM within a microengine, according to one embodiment. 
100191 Figure 17 is a block diagram of an example of pseudo code for the process 

example of Figure 16. 

,5 100201 Figure 18 is a block diagram of an example of a processor having multiple 
microengines according to one embodimem. 

100211 Figure 19 is a block diagram illustrating an example of a data processing system 

according to one embodiment. 

■* 

20 riFTAlI .F.D DF SrRIPTION 

100221 Automatic software controlled caching generations in network appUcations are 
described herein, to tire following description, nume^us specific details are set forth. 
However.Uis understood that embodiments ofti^einventionmaybepracticed without titese 

25 speciBc details, ht otirer instances, well-known circuits, structi^s and .echm,ues have not 
been shown in detail in order no. to obscure the understanding of tins description. 
10023, Some poriions of the detailed descriptions which follow are presorted in terms of 
algoritinns and symboUc representations of operations on data bits wititin a computer 
memory. These algoritiunic descnptions and representations are used by titose skilled in tire 
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data processing arts to most effectively convey the substance of their work to others skilled 
in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of 
operations leading to a desired result. The operations are those requiring physical 
manipulations of physical quantities. Usually, though not necessarily, these quantities take 
5 the form of electrical or magnetic signals capable of being stored, transfenred. combined, 
compared, and otherwise manipulated. It has proven convenient at times, principally for 
reasons of common usage, to refer to these signals as bits, values, elements, symbols, 
characters, terms, numbers, or the Uke. 

(00241 It should be borne m mind, however, that all of these and similar terms are to be 
10 associated with the appropriate physical quantities and are merely convenient labels appUed 
to these quantities. Unless specifically stated otherwise as apparent from the following 
discussion, it is appreciated that throughout the description, discussions utilizing terms such 
as "processing" or "computing" or "calculating" or "determining" or "displaying" or the like, 
refer to the action and processes of a computer system, or similar data processing device, that 
15 manipulates and transforms data represented as physical (e.g. electronic) quantities within 
the computer system's registers and memories into other data similarly represented as 
physical quantities within the computer system memories or registers or other such 
information storage, transmission or display devices. 

[00251 Embodiments of the present invention also relate to apparatuses for performing 
20 the operations described herein. An apparatus may be specially constructed for the required 
purposes, or it may comprise a general purpose computer selectively activated or 
reconfigured by a computer program stored in the computer. Such a computer program may 
be stored in a computer readable storage medium, such as. but is not limited to, any type of 
disk including floppy disks, optical disks. CD-ROMs, and magnetic-optical disks, read-only 
25 memories (ROMs), random access memones (RAMs) such as Dynamic RAM (DRAM), 
erasable programmable ROMs (EPROMs). electrically erasable programmable ROMs 
(EEPROMs). magnetic or optical cards, or any type of media suitable for storing electronic 
instructions, and each of the above storage components is coupled to a computer system bus. 
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herein are not inherently related to any 

paxuculaxcompu^oroa^erapparatus. conv»i=n. to co,^c. 

p.^.. — wi**e,e..n.H....r«.VP^^^^ ^^^^^ 

f the invention as described herein. 

information in a form readable y ,..r,oU'y random access memory 

.a..ne....e.e.™...es,..o^^^^^^^^^^^^^^ 

„a^e«c disk s^rage med,a. opUcal ^^^^^ 
aec.„ca,,op..ca,,acous..ca,o.o.H.fo™orpropaga.e.s„naU<... 

,^a.s. di.ua, si^aU - ^an .e used .o he,p 

5 ,00281 in one embodiment, Softwar single read and one or moie 

effectively minimized. ,„„l. of external memory accesses 

.w^ikdiastamiUvistratmg an example olexte 

20 ,00291 Rsute2.sablockd.agr ^^^^ ^^moty access example 

.yn.ultipletbreads,accotd,nstooneembod»e ^ 

asKMW operations. ■noneembodintent,.l.eKMW 
^.eoperation,alsore^e^..^^^ 

25 °P---°f*'-*'°^-\ ^,^^,,^,p,,,,„„.tKeend,usingacachingmechaoism. 
"°^^"'""°^'"7:':ll.^erbelo..Asaresult,tbenntnberofext 
Which will be descnbed m details lu ,nter-thread dependence is 

• • cn^ificantly reduced, and the latency caused by the 
accesses is sigmncaniiy ic^ 



greatly reduced. 



n,i/wi ^UU^ / U U U o or 



,0030, n..e3.ab,oc.di.^a.m— anexa^plcotacac^ns — 
embodunen. ,„ on. »bodin,en, tt>c caching mecha:^- 300 may »e 
3ccor*ng.oonee.bodnnen.. ^ „^,„gines. such a., for 

taplcmcntedwilhinamicroengin^ofaprocessorha = 

,. I„.,11XANPU family of network processors (IXP). 
exa«pie.h,.,KA>^ ^^^^^^^^^^^^^ 

5 ,0031, Forexan,pie,accord,n ,^„,,,^essors(IXP).a 
havingraultiplemicroengines.suchas.IntelKANl'Ul 

havmgm r;^,„„„ and local memory (LM) may be combmed to 

content addressable memory (CAM) un,. ^ 
implement the software controlled cachmg. TheCAMu 

.. PachoftheentriesintheCAMtmUstoreathestateand =P 

' , ,,,RU11o^c maintains a time-ordered list of CAM 

,0 a cache Une and tts least recently used (LRU, lo.,c ^ 

:r::;r:::::;La.w.tebac.ofthedat.etc.>aten^ 



control 

15 



^oneembod,ment,aprocessor,nc,udes.b.tno.limi.ed.o,ma,aple 

[00321 in one memory respectively 

.icroengineshavingacontentaddressablememory CAM,»^ 

..rformmnltiplethreadss— llyconcu.rent,y,eachof*d.^^^ 

„oren..rncUonsperformtnga.leastonee.terna,memoryacc. as^ » 

^Hs substantially identical, where the base address ,s exammed .n the CAM 

,0 whether^eCAM — anentrycontatmngthebaseaddr^s.^^^^^^^^ 

thp TAJM is accessed without navuig 
memory corresponding to the entry of the CAM xs 

, nrv if the CAM includes the entry containing the base address, 
external memory, it the UAiviui „.ha„ism example 300 

o . -na to Fi^e 3 in one embodiment, the cachmg mechamsm examp 
[00331 Refemng to Figure 3, m , ^ ^ cachin- mechanism 

. • ^ . o r AM 301 and a LM 302. Although, the cachin^ m 
— ^^'""'''""T^ra^dLmitwiUbeappreciatedthatmorethanone 

CAM and LM units are used as an example lu 
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me present applicadon. However, U wiU be appreciated that other t^e. of 
„ and mechamstns may aUo be in>plen>ented without departing broader scope and 
spirit of embodiments of the invention. 

10034, taoneen,bcdimen,CAM301 includes one ormore entries 304-306, where each 
5 of the entries 304-306 includes a tag field and a state field to store .he tag portion and the 
state pordonofacache line respectively, to addition, CAM 301 fitrther includes a leas, 
recently us^ (LRU) logic to detemnne me least recenUy used entry of CAM 301 . In one 
e„>bodime„., IM 302 includes one or ntore entries 308-310, where each oft he enmes 

. _ _ *>^r+4<^n nf a cache line. 
308-310 IS usea to siore ■<x u<xta ^v^-- — - ^ 

, 0 [0035, In one embodiment, when a request for accessing location having a base address 
of an external ntemory is received, the microen^ne that handles the thread of the request 
may examine (e.g., walking throu^) .he CAM 301 .0 ,oca.e an entry having the requested 
base address. For example, each of the entries 304-306 of CAM 301 maybe exanuned .o 
ma,ch me requested base address of the external memory. Typ.cally, the envies 304-306 of 
CAM 301 may store the base addresses of r^ently accessed external memory. If the entry 
havmg the requested base address is found in CAM 301 , LRU lo^c 303 returns a result 31 1 
having state field 312, status field 313, and entty number field 314. The state field 3,2 may 
contains the state of the cor^sponding cache Ime and/or the state of CAM, and status field 
3,3 indicating whether the cache is a hit (result 3Ub) or a miss (result 3Ua). The entry 
number field 314 contains the hi. entry number of the CAM 301 dtat con«ins dre requested 
base address. The hit entry number may be used to access an entry of m 302 according to 
a predetermined al^ridnn. which wtU be descrrbed m details ftrther below, via index logtc 
307 of LM 302. 

,0036, ,fi.,sde.ermir.edtha.theCAM30, does not contains the requested base address. 

25 LRU logic 303 returns a least recently used entry ofCAM 301 (e.g.. result 311a). The least 
^dyusedentryofflteCAM is linlted.0 an entryofLM 302 that maybe usedtostore(e.g., 

caching)*edata6omUreextemaImemory access for subsequent external memory accesses 
to the same base address. 
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10037, TheLRUIogic303maybeimp.eme„.edasapar.ofCAM301and/ormdexlog.c 
„aybeimp,em.«=dasapanofLM302.However,*eco„figurationsmay„c,bel>nu.cd» 
a. one shown in Figure 3. It will be appreciated that other configurations may exrst. For 
exan,pie, accord.ng to an alternative en>bodin>en.. the U.U lo^c 303 may be implemented 
5 extern^ to CAM 30 1 and/or the index logic 307 may be implemented external to LM 302. as 
caching mechanism exantp.e 400 shown in Figure 4. Further, the CAM 301 and/or m 302 
may be implemented as a part of a global CAM and/or IM that are sh^ed among muUrple 
micrcengines of a processor, where the global CAM and/or IM may be pardtioned mto 
multiple partWons for multiple microengines. Other configurations may exrst. 
,0 100381 Figure 5 is a flow diagram illustraUng an example ofa process for software 

automatic contr„lledcaching.accordingtooneembodrme„t.Exemplaryprocess 500 may be 
perfo™edbyaprocessinglogictha.maycomprisehardware(circuitry.dedicatedlog.c.etc.). 

software (such as ,s run on a ded.cated machine), or a combination of both. For example, 
process example 500 may be performed by a compiler when comp.l.ng source code wrrUen 
,5 in a variety of programming languages, such as. for example. OC^ and/or assembly. In one 
embod,men>, process example includes, but not Ihnited to, identifying a candrdate 
represenfng a plurality of instnrctions of a plural.ty of threads that perform one or more 
externa memory accesses, the externa, memory accesses havmg a substantially identrcal 
hase address, and inserting one or more directives or instructions mto an hrstruction stream 
correspondmg to the idenHfied candidate to maintain contents ofa. lea« one of a content 
addressable memory (CAM) and local memory (LM) of a processor and to modify at least 
oneoftheextemalmenroryaccesstoaccessatleastoneoftheCAMandmoftheprocessor 

without having to perform the respective extemal rnemory access. 

10039, Refetmg to F.gure 5, atblock 501. processing logic receives source code written 
in a variety of pro^ammrng languages, such as. for example. OC- and/or assembly. At 
block 502. the processing logic parses tire source code to rdentify one or more candrda.es for 
software controlled cache, such as. for example, extern^ memory accesses u, each fl^ead 
having substanhally rdentical base address. At bloc. 503. the processing logic insert one or 
more instrucUons in.o a>e mstructrons steam of the source code to maintain images ofa 
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CAM and/or LM of a microengine of a processor, and to modify tt>e original external 
memory access of *e candidates .0 access .he data .mages of .he LM withou. having .0 
access external memory. Ofter operaUons may also be performed. 
,00401 one embodiment, after te nCwork applicadon is multi-teaded. cAer 
o^ually or au.oma.ically.hroughaparallelizingcompiler.each.hxead performs essenually 

tt,. same operations on a newly recerved pac.et, and modifications to global resonrces (e.g.. 
external memory accesses) are protected by critical sections. This transformatton 
automatically reco^izes the candidate (externa, memory accesses in each thread) for 

. . . ^ .u. ,„ru.,.r. rontrolled caching (e.g., maintaining fte CAM and 

caching, ana impiemciu:> u.v. Sv... 

LM images, and modifying Ure original accesses .o access .he data image in I>i). 
,0041, m order to provide a candidate for software controlled cachmg. the cand.date has 
to be identified out of multiple potential candidates based on an analysis of the 
correspondmg source code. Figure a is an example of a source code, which may be used to 
provide candida.es for software ccnuolled caching accord.ng .o one embodiment. Refemng 

be processed via multi-threading processes. 

,«„42, U,oneembodimen..sourcecodeexample600maybeana,yzedtogenerateapoo, 
of potential cand.dates, as shown in Figure 7. Refennng to Fig^e 7. tn this example, the 
pool of potential candidates 700 include candidates 701-705. In one embod,men.. 
cand.da.es 70 1 -705 may be analyzed to identify a closed set of accesses the externa, memory 
viathesamebaseaddress. That is. .hey access addresses (base . off^t), where base . 

a constant. This set is closed in the sense that it contains aU accesses .n O-e .taead *a. are m 
a dependence relation with each Other. 

,0043, AS a result, as shown m Figure S according to one embodiment, any ineUgtble 
candidates maybe screened out and one or more eligible cand.dates may be merged mto a 
larger c^didate based on their identical base address. In one embodiment, a candidate .s 
i„eU.ble for cachmg if .he base is no. common .o al, *e accesses in the candrdate. 
AltelaUvely. if the memory locations accessed by a candidate can be accessed by other 
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€S 

„ respective candidate . ineU^Mc . *is exa.p.e. ca„a,..e 7ons ^ 

.a„...es.0a.™S.ee..Meca.a...e.^a.e.e,.ave.e.e„.ca.a3e^^^^^^^^ 
s.a.e . 16.,) and *us. candidates 702-705 are merged in» a single larger ca„d.da.e 
i final candidate for caching. 

,0044, Figure 9 is a flow dia^an, i— g an ex.n,p,= of a process .0 rdenUfy a 
Ldida.e.rcac.ingacccrding.oo„een,.drn.en.THeprocessexan.p.e,^^ 
perfor.ed..aproc=ssing,ogic*a.»aycon,pHse— e«.dedrca.ed.ogrc,.^ 

' ......„..H™,.hineVoracombinationofboth. Forexample. 

software (such as IS ran on a - a.^„.„ 

■ inava,ie.,ofprogrannning.anguages,sucHas,forexa.p,e,aC^ana/orasse2^ 

10045, aefe™g.oP.gure9.a.Moc.90,..Hepn>cessing.ogicpar....onssubs.ana,^ 

i:ee..rna.rnen,or,accesses,n.oo„eorn,o.se,ofca„d.da.es(e.g.,ci^dse..^ 

.an,p.e,ifaccess.dependsonaccessB,.e.™a..egro.pedin.esa.eparn^^T^^^^ 

, ..o accesses are in a dependence re,a.io„, .ey sho.d access dre sanre dau.nrage (e.drer 
rn.heexrema,n,en>o,VOrin*e.oca.n,e.ory,.HenceeacKc,osedse.idenUfled.a 

candidate for caching. ^«„<,nv 
,0046, A. bioc. 902. one or n,ore copy-forward .a„sfom,a.ions nray be opUonaily 
luu'toi ^ Forexamole the following 

performed on the addresses ofeach external memory access. Forexampl 

20 operations: 

a = b + c; 
d = load[a]; 

may be transformed into the following operation: 

d = load [b + c] 

25 100471 Atbloclc903,oneormoreglobaWaluenun.beringand/orconstantfoldmg 

. edforeachthreadForeKample,duringaglobalvaluenumbermg 
operations may be performed for each inreaa. 

operation, the following operations: 

a = 2; ■ 



b = c*a; 

10 
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d = 2; 
e = d*c; 

may be transformed into the following operations: 
a = 2; 
b = c*a; 
a = 2; 
b = c*a; 



,„„481 For example, during a cons^Bt folding operation. *e following operations- 



a — 2; 

IQ b = c + d; 

e = a + b; 

may be transformed into the following operations: 
a = 2; 
b = c + d; 



10049, M block 904, for each e«ema, memory access of each candidate, U,e address of 
are externa, memory access is converted into a form of (base . of^et). one embodiment, 

.ha. if a pro-am has been va,ue numbered in the sense that identica, addresses axe made to 
20 have rdentical represen.at.ons, the effectiveness of the transfonnadon will be improved. 

,00501 A. block 905. one or more el.^ble candidates, such as. for example, candrdates 
702-705 of Figure 7. are identified, while the ineli^ble candrdates are screened out (e.g., 
elunina.ed). h, one embodiment, if the base address is no. identical over substantially aU the 
external memory accesses .n the partition, the corresponding candidate or candidates are 
25 considered as ineligible. Fu«her, if the memory locatrons of the external memory accesses 
may be accessed by other pro^ corresponding candrdate or cand,da.es may be 
considered as inehgrble. is, if ti^e memory addresses being accessed are known to other 
programs, such as. for example, externally defmed, they are not suitable for caching. Such 
candidate or candidates may be identified via an escape analysts performed by a oomprler. 
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,0051, A. bloc. 906. me eli^ble candidates may be consoUdated into a single large 
candidate. For example, aocor^ng to one e.boain,e„., al, of the eligible candidates baving 
the identical base address tnay be gtoup«i into a single candidate, such as candidate 80 o 
n,^. S. A. bloc. 907, a candidate having largest candidates grouped is selected as a final 
candidate for caching. Odter operations may also be perfonned. 

,0052, Note that the software controlled cachmg is used to reduce the latency penalty of 
modificationstoglobalresourcesbymulUplethreads. However,, the CAM and LM units 

are global resources shared by multiple threads, all the caching operations should be 

A to one embodiment, in order to sunphfy the 

protected by a cnucai bckin^u. ^^^—-^ 

synchronization needed for the cridca, sectton. only one candidate (e.g., one global resource) 
i. .elected. However, it will be appreciated that more man one candidate may be selected 
dependent upon a particular system design, as long as the cntica, s^.on is handled 
appropriately. 

,„„53, After the final cand.date for cachmg is ideu.if.ed, one or more instracttons are 
i„.er.edinto.heco,respondinginsttuct.o„s.reamto — .eCAMa„dLMun..s,and.o 

.od.,y the ong,na, accesses in .hecandidatetoaccess.heda.ain.age in LM. Figure .Otsa 

hlocic diagram ihus.a.ing an example of pseudo code .ha. a caching ins.rucUon has been 
inser.edbeforeeachexte.almemo^accessins.„c..o„.lnsomecases..heinsertedcachmg 

instntctions may be dupUcated and can be consoUdated in.o one caching ins.„c.,on per 
„la.ed ex.emai memory accesses, as shown in Figure 1 1 . Tltereafter, ..e inseried cachtng 
insmtcUons may be expanded .o mod.fy .he ex.ema, memory accesses to access ^e 
corresponding local memory insiead of accessn,g ex.ema, memory, as shown in F.gure 12. 
ln«.sexample,base = state.i.l.,m = 0,M=13andnis.heen.ryinCAM^tconta.ns 



the base. 



25 



LUC uas>^- . . 

,0054, consequently, when d,fferen. threads access the same data, they can dtrectly 
access.heda.aimageinmasiUus«a.ed.nFigurel3(a3sun.ed.es.ar..ngaddressBmm 

.0, maddiUo^sometetheropdmizationsmaybeperformed. For exantple, in response 
.0 a CAM lookup miss, the expanded caching instructions could only write bac. the d^ 
hytes and load only the bytes actually used. Other operations may also be perfo^ed. 
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^ ..r^rontiate software controlled caching, sufficient local 
100551 In order to perform appropriate so .^^e 14 is a block 

embodiment. Retemn„ i« = 1403-1404 which 

1 Am CAM 1401 includes one or more entnes i-^w 
1401 and a local memory 1402. CAM i^u 

Forthepurpo in the selected candidate (havmgn 

maximum address byte accessed over - 

. .frAMHOn andNisthenumberofentnesmCAM,thenN 

entry of CAM 1401), ^^^^ ^^^^ .^^^^^ 

.servedmLM302forthedataimage.Assumetl^est g 

3021SB Ifthebaseaddressofthecandidatcsstoredrnthen entry 

• f otoNlthenthedataportionoftheassociatedcachehnemLM302 
where n is ranging from 0 to N-1 , then 

/ a.1 m+n - 1 as indicated m memory space I'tuo 
• f «,r, R + n*fM-m+l)toB + (n+l)*(M-m+i) ^^'^ 

IS from B + n uvi m ^ nrocess for automatic 

-^^^'t::::— ^^^^^^ 

performedbyaprocessmglogicthatmaycomp Porexample, 

„ A,blockl504,tt.ecachingms™c.iomare«pand.dm.ooneor«or 
r::s:::ol.up.e.as...ess..C^,e.c...= o.eae...a.. 
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,„ase.nU>eLM.No..h.a«.h=— onsi»p>en>e.in,*= software co„.oU^^ 
eacl.— *e«pa„ac.cacHi„S — a„-e.oa.e<.aoce..,.o^^^ 

prot^tedbyacnncal section. Ote operations may also be perfonned. 
„«, Uiaisano»dia^m.tta,.nsane.a.p,eoraptooessfotmatn.an„n, 

LlraCAMan.otU.witbina,nicroen.ne,accotain.too„ee„bo.»^^^^ 
e^e,600n-aybepe*™eabyaprocessin.io.ict.atntaycomptisehatdware 

. . . ........ r..,.h., is run on a dedicated machine), or a 

rcircuitry dedicated logic, etc.j, t,uiiwa.v. 

. le .rocess example 1600 may be performed by a compiler 

0 combination of both. For example, process exampi 

.Ken compilm.source code writteninavarietyofprosramming™^^ 

example, aC..an.orassembly.Theprocessexamplel600maybeperform^^ 
operations perfomied at block 1504 of Figure 15. 

;U LfernngtoFigurel^whentheprocessinglogicreceivesare..^^^^^^^^ 
, externalmemorylocation,atbloc.l601,theprocessmglogiclooksupmtheCAMo 

.etern.ne.hethertheC.Mcontainsabaseaddressofthere.uested.^^^^^^^ 

access. IftheCAMcontainsthebaseaddressofthere.uestedexternalmemor.^acce.^^^^^^^ 

ahit)atbloclel606,anindexofthe corresponding entryofthe CAM that contamsthebas 
a hit), at block louo, ^^^^^^^ 

^ ot Kir.rlr 1 607 an entry of the LM is accesseu uoo 
address is retneved, and at block lou /, an y 

20 index without having to access the external memory. 

...eCAMdoesnot— ..ebasea.atessof.bete,ueste.e.te.a,n.e». 

rJ(es.,a.iss,,atb,oc.:60.,aieasttecentiyuse.a.U,en..oft.eCAM.sal,ocatea 

::Jti::Ce.S.,anen...orpre.onscac.n.opetationo,apte™^^^^^^^^^ 
3^ss).anatbein.exan..beaa.tessstoreatbere.arete.tieve..Atb.oo..eO.^ 
,3 re.tievedaaatesss.oteaintheU.Uenttyof.beCAMUexan^edtoaetenn«=whetber..e 

address is valid. . ^ tViP 

,00«, ..eaadtessisdeten„-neatobeva.ia.atb.oc.l604,.heaatastote.m.he 

. , , ™orv(e» .hepreviouscacheddatatorapreviousextemalmemoty 

cotrespondmg local memory te.g. IMP' ,.„«fied vaUd 

acceS3).s«ri..enba*(e.,.,sw3pped)into.hee«emalmemorybasedon.hetdent.fiedval. 
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..<^ss.Thu.,a>eLRUe„.^of*eCAMand*.correspona„SlMspac=a^aow 
.vaiUMe ..cacMng«.= cu:..n. extend .>en,o. access. ..Mcc.>603,.eaa.ome 

. , ^ J ■ .„ fh, I RT I entrv of the CAM and the corresponding LM 
current memory access is loaded mto the LBXJ entry ol 

3p.e .om the e«ema> memory locattonidentrfiedbyUrebase address. AtMoc.16^, the 
da.astoredinthe>oca,memoryUret.medmresponse.o.here,nest.Pi.t.enrsa^^. 
aa^ofanexsmpleofpseudo code for theprocessexampleiaOOotFtgureia. Other 

operations may also be performed. 

10063, ngurelS.abloc.d.agramofanexampleofaprocessorhavingmul.ple 

^ ' , ^.......,.or example 1800 includes multiple 

microengines according to one emooaxmcu. - - ^^^,1003 i804 

• iROi 1R02 includes CAMs 1803-15U^ 
xnicroengines 1801-1802. Each of the microengxnes 1801-1802 

.A hv LRU loeic 1807-1808 respectively. Each microengme 
respectively, which are managed by LRU logic l» -,0001810 
...er.ncludesLMslS05-lS06respectively,whicharemanagedbyindexlogicl8^^^ 

respectively. The microengines 1S0MS02 may be used to perform automatic software 

H...rihed above where each of the microengmes may perform 
controlled caching operations described above, wn 
, .ehoperat.o„sforarespec,ive.hr.dsuhstanUa.,yconcurrent,y. U win he appreciated that 

.ome wel~ components of a processor are no. shown in order not to obscure 

embodiments of the invention in unnecessary detail. 

,00., Pisurei,isab,oc.dia.rami.-us.ratm.anexampleofada.aprocessm.s.tem 
.cordmstooneemboaiment.Thee.emp.arysys.em.900maybe.edtop=r.^^ 
,0 processe.amp,esforau.omat,csoftwarecontroUedcachin.aescribed.ove. N te^t 
' L,ePigurei9iilustra.esvariouscomponen.sofacomputersystem,itrsno.m.ended.o 
^tanyparticuiar^chitec^crmamrerofiuterconnectmsthe components,, such 
dLarenotsermanetothepresentinventton. Xtwiiiaisoheapprecratedthatne..^^^^ 
computers,handiieidcompu,ers,ce«phones,andotherdataprocessin.sys.ems,whrchh^^ 
. fewLmponentsorperhapsmorecomponen.s,mayaisobeusedwith.hepresentmven»n. 
Xhecomplsyst.mofH^e.may,fore.amp.e,heanApp,eMacmtoshcompu.eroran 

:ort::::to..urei..hecomputersystem..OOrnciudes,butnotinnrtedtoa 
'rcl..0.thatprocessesdatas..a..Proce.sori,0.mayheane.emp.aryprocessor 
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• ^ The t^rocessor 1902 may be a complex instruction set computer 

100 illustrated in Figure 1 . The processor y 

case, ^coprocessor, a rCuced i„s.n.cUon se. co.pu..ns (KISC) „cesso. a 
ljs»c..onwo.(V.:W,— esso.a.ocesso..p— aco— 

^Luoose.o,o..P_ae.ce,s.c.asa...U..a>p.ocessor...^^^^^^ 

,i^el9s^wsanexa.pleofanc»bodi.e„.of*=prese„.~— 
lp.ccsso.s..e..mHow..,U. — .a.—n .30f.ep.esen. 

,„e„..o„ n,a, aUer„a.,ve.y be in.plen,ented as sys.en,s havin, mul^e processors, 
invention m y transmits data signals between 

Processor 1902 may be coupled to a processor bus 1910 



processor 1902 and other components in the syst 

1 QCiO includes but not Umited to, multiple 
10 100661 In one embodiment, processor 1902 includes, but 

ll.nes«.0-... THe„.roe..„es>.0-.«ma..euse<..oper.or™a— 
^n„3re condone. cacHng for ™u.p.e .hreaas subs.n«a,W concurrenay. 

,00., .„a.a.o„.s..e..900.„c.u.esa.e™o.,..6. ^^^^^ 
Uicranao™ access .e,nor.(O^M,ae.ce,as.a.,cra„ao.access™e.o.^(S^^ 

A V. Memory 1916 may store instructions and/or data 
IS device or other memory device. Memory lyio y 

epres^„.ea..aa.s..a,s.Ha..a..ee.ecu.ea..processor.... ~c.o.an.or 

l„a,.c.uaecoaeforperforrn.n.any-o-"o^-— 
..e„«o.Aco,„p«erforco,npiUn.sourcecoae,.„c,ua....e.«^^^ 
..a.a.ess.«efor software co„.roUeacacHn.anainser.in.an-xpa^a.n.c^^^^^^^ 

.0 .s.n.c.o..roaccess.e,oca,n>e.r.or.raa,er.an.ee.ema..e.„or.. M^.^ 

.....a^aaaiUona. software and/or aa.nc.s.o»„.Acache.e.or,:904».y.^ 

.S.eorou.ae..eproceasor.0..a..oresaa.a.^.srorea.n.e.or^f.^^^^^^ 

„.904.U,.se.boainre„.speeasup.e„rorya.essesW*eprocessor.y.a>^= 

of i. locality of access. ^ ^^^^ 

OS 100681 Eurther, abridge/memory controller 1914 may 

25 100681 hurm , ...troller 1914 directs data signals between 

1910 and memory 1916. The bridge/memory controller 1914 dir 

,902 memory 1916, and other components in the system 1900 and bridges the 
processor 1902, memory , 19 16, and a first input/output (I/O) bus 

data signals between processor bus 1910, memory ,,V,nortfor 
1920. insome embodiments, thebridge/memory controller providesagraphicsport for 
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coupling ,0 a ^.ics — In .Ms en.bo.in,en. ^ph^os oon.oU« >9n 

^aphics controUcr .9,2 .o a user. The display device u,ay include a television se, a 

con,puter n^onUor, a fla. pane, d.splay. or o*er suiuble display devices. 

,0„; .i..l.O.us..0n«y.cludeas.nsle.uso.aco..i„a.ono,— ^^^^^ 

v.t^r^n links between components in system 190U. A 
First I/O bus 1920 provides communication links betwee 

„e«,orl: con.™l.er.92an-ay.ecoupled.o the fixst^Obus 1920. Thene»,o*oon„olle. 

U^sys.en> 1900.oane™o.iha.n.ayincludeapWi.yofptocessin.sys.e.^ 
.ppo«s CO— ion antongvario. systems. Thene«.otlcofp~gsys.e»^»ay 
>„cludea.oca,axeane.wo*(LAN,,awideareanetworlc(WAK),.heh„en.e.,oro.^e. 

„e.v,o* ^ compiler for conrpiling source code can be transferred fron, one conrputer 
lersysten,,hro.^ane.«or..Sin,ilarly,co.pi.edcode.ha.includes,hed.rectrvesor 

instruction inserted by the contpiler can be transferred fro™ a host ntachine (e.g.. a 
,eve.opn,ent machine, to a target ntachine (e.g.. an executton nrachtne). 
, ,0070, Utso™eembodin>ents.adisplaydevicecon.rollerl924ntaybecoupledtothe 

L.Jo.us.920..hed.sp,ayde.cecontrollerl9.al,o„scoup,tngofadisp.>.^^^^^ 

,,.entl900andactsasaninterfacebet«eenad,sp,aydeviceandthesyste.. Th 
.evicen,ayco.priseatelevisionset,acon.puternront.or.aflatpane.d.splay.orot^« 

, . ■ The disnlav device receives data signals from processor 1902 
suitable display device. The display aevice 

, . • controller 1924anddisplaysinformationcontainedmthedatasignals 
20 through display device controller i^z^aiiu v 

to a user of system 1900. „f„,„ltinle 
,„07« AsecondVObus :930ntayeon,priseasinglebusoracontb,na,tonofn,ulttple 

buses The second I/O bus ,930 provides con^tunication links be^.ee„ components m 

. lgedevice.932.nayincludeaharddrs.drrve,a„oppydis.dHve,aC.-KOMdev. 

Jnten,orydevice.orotherntassstor.edevices. Oata storage dev.ce ,932 ntay .elude 
oneorapluralUyofthedescribeddatastoragedevic.^ ^^^^^^^^^ 

* ;r,tp.rfare 1934 may be coupled to tne secoiiu i/v^ 
100721 A user input interlace ly^i'tm y v r ^« 10-54 

Lelple,a.eyboardorapo.ntingdev.ce(e.g.,anrouse,Thenser.nput.nterface,934 
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.ay include a keyboard controller or Cher keyboard inlerface device. The uaer input 
eonlrollercrothercon^oller device. TheuserinpuUmerfacel,34al,owscouplingofauser 
inpu, device (e.g., a keyboard, a mouse, ioy^tick. or ttackball, etc., ,0 system 1900 and 
tranamits data signals from a user input device to system 1900. 

,00731 Oneormorel/Ocontroliers 1938 may be used. o connect one or more I/O devrces 

to the e^emplary system 1900. For example, the .0 controller 1938 may 

(umversal serial bus) adapter for controlling USB peripherals or alternatively, an IEEE 

....,u.._„„,„ll,rforcontrollinglEEE 1394compatibledevices. 
(also referred to as rirewuc; uu> 

10074, Furthennore, the elements of system 1900 perform .heir conventional funcUons 
well-known in the ar.. In particular, da. storage device .932 may be used to provde 
long-temr s.orage for .he executable instructions and data structures for embodiments of 
methods of dynamic loop aggregat.on m accord^ce with embodiments otthe present 
invention, whereas memory 1916 ,s used .0 store on a shorter .erm basis the executable 
instructions of embodiments of the methods of d^amic loop aggrega..on in accordance w,.h 
embodimentsof.hepresen.inventionduringexecu.ionbyp,ocessor.902. 

,0075, AUhough .he above example describes .he dis.rib„.ion of comp„.er code v,a a 
da.as.oragedevice.programc„demaybed,s«.butedbywayofother computer rea^e 

.„mnuter nroaam may be distributed through a computer readable 
mediums. For instance, a computer program may 

med.umsuchasalloppydisk,aCDROM,acarrierwave.a„etworic.orevena,ransm.ss.on 

over dre hrteme.. Software code compilers often use opiimizadons during d.e cpde 
compilation process in an altemp. to generate faster and be..er code. 
J. xL, — softwarecon^oUedcachinggeneraUonsinnetworkapphcatron 
Le eendesenbedherem h,theforegoingspeci.ica.ion,.hehrven.io„hasbeendescr.bed 

wi.reference.ospeci«cexemplaryembod.me„.s.ereof.lt«iUbeeviden..a.v.^^^ 
modif.ca.ionsmaybemadedrere»wid.ou.departingf.omd.e broader s,n. and scopeof 

,„ve„.ion . set for«, in the following claims. T.e speciacadon and drawmgs are. 
accordingly, .o be regarded m an iUust^tive sense rather dran a restricUve sense. 



20 

over 



25 



18 



PCT/CN2004 / OO 0 5 38 



CLAD^ 



What is claimed is: 



10 



A method, comprising: 
Men.i.yinsaca„dida.er.pres.ti„gapluralUyofins— ofa« 

fl>a. perform one or more externa, memory accesses, me externa, memory 
accesses having a substantially identical base address; and 

,nrt instructions into an instruction stream 



inserting at least one ui ui^^v,.. . 



corresponding to the identified cand.date to maintain contents of at least one 
of a content addressable memo-y (CAM) and local memory (LM) of a 
processor and to modify at leas, one of the extemal memory access to access 
a, leas, one of the CAM and LM of the processor withou. having to perfom. 
the respeCive ex.emal memory access. 



2 The meUiod of claim 1 , fL.r.her comprising: 

parritioning the plurah.y of instructions of the externa, memory accesses into one or 
„„,e set. of potential candidates based on dependency reiationshtps of the 

. r instructions; and 

selecting one of the potential candidate set. as the c^didate, instrucUons of the 
candidate satisfying a predetermmed dependency relationship. 

_ converting addresses of each extemal 

memory accesses mto a 

constant part and the offset is 



The method of claim 2, further comprising ^ 

form having a base address and an offset. 



The method of claim 3. wherein the base address is 
a constant part of the converted address. 



a non- 
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•o^r^a Qrreeninff out one or more ineligible 
The method ofclaim 3. further compnsmg screening, o 

address .ha. is differen. 1^. a «n.ai„d« of *e pc.=„«a> cand.da.es. 

^e»eU.odofcUi»3>«h„oo„p.s.,.oupi„sn.uUip.po.en«a> Candida.. 
"l,s— „yiden.ica..aseadd.ssn..oas..ieca„d,da.e,whe..na^^^^ 
L:n.os.or.ep«.en.ia,ca„dida«sisse.ecMasa«n.candida.ero.cac.n.. 

. . . ,.,-.,A.;.„th, candidate further compiises: 

The method ofclaiml.wherem me .aeu..>... - ^ 

p»rfor.i„gaoopy-forward.ra„sfcn„a.ion o„add«ssesofeachof.hee..em 

memory accesses; and 
perfor.i„.a.L.o„eoU.oha,va.e„..heH„.opera.i.nandacons.a„..,d»S 

operation for each tliread. 

1. The method of claim 3, further comprising 
1 thread, reserving a 
of cache lines; and 
inserting a caching instruction prior 



fr ;.nt soace m the local memory to store data portions 
for each thread, reservmg a sufficient space 



to each of the external memory accesses. 



10. 



•c^np seeking the base address of each external 
The method of claim 8, further compnsmg seeking m 

ThememoQu the CAM includes an entry that 

in the CAM to determme whether the CAivi m^i 
memory access in the L^Aivi w u 

contains the base address being sought. 

u ■ ifthe CAM includes an entry containing the base 
The method of claim 9, wherem if the CAM 

„v,t th*. method further comprises: 
address being sought, the methoa lur containing 
• • .n offset of the local memory based on the entry of the CAM 
determming an ottsex oi uic iv 

the base address being sought; and 
_s,n.da.at..anen.o.d.e.cca,n.en.or.re.e^cedh,.ede.erm.ned 

offset. 
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a.<.ess bein, sought, ™etHo. ^e. comprises aUocaUn, a leas. rece. . 
.e.(LRU)e„..of..CAM.avin.a.a3ea.<.=.ofap.v.ousex.n.a,«e.o. 



access. 



m method of claim 1 1 , further comprising: 
■ .adm.dataofacurrentextemalmemoryaccessfromtheexte..memoryntoa^ 

entry of the local memory referenced by the allocated U.U entry; and 

^ ^.^nrv access in the LRU entry of 

storing the base address of the curreni c;.vo.... 

\e CAM rep.acI„.<he.aseaaa,eas.r*ep.ev>o.e.e™a,™e.o. aces. 

3. The.e.ho.ordai» U, fur.h.rco,nprisin. 

examining the base address of the prev.ous exten^al memory 

LRU ent^ to determine whether the base address ,s vaUd, and 
.ephcattngdataofan entry intheioca. memory correspondingtoU^eaUooatedLKU 

external memory access. 

method, the method comprising: f»„,uralitvof threads 

idcntifyingacandidaterepresenUngaplnraiityofinstruct,onsofaplural,tyofthr 

;:perronnoneormoreex.er„a,memoryaccesses,.he^^^^^^ 

accesses having a substantially identical base address; and 
lr,serting at least one of directives and .nstructions into an instruction st^ 

Lespondingtotheidentraedcand-datetomaintarnconten..^^^ 

of a content addressable memory (CAM) and local memory (tM) o 
.^andtomodityatleastoneoftheextemalmemoryaccesstoa .s 

LeastoneoftheCAMandLMof^eprocessorwithouthavrngtoperform 
the respective external memory access. 
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^- rlaim 14 wherein the method further compnses: 
1 The machine-readable medium of claim 14, wn 

.„„.e.ctpo,e„«a,ca„dida.e.basedon<iep.na=ncyre>a«o„sh.psof.he 

instructions; and 

3,ectinsoneofthepotentiaicandidatesetsasthecandidate,— 
candidate satisfyingapredetermined dependency relationship. 

f , •rr,!'; wherein the method further comprises 
ifi The machine-readable medium of claim 15, wheremm 

16. The macm .^^^ ^ ^^^^^^^ ^ base 

converting addresses of each external mcu.... 

address and an offset. 

. of claim 15 wherein the method further comprises 

potential candidates. 

f nlnim 15 wherein the method further comprises 

as a final candidate for caching. 

' r:::--— — ^^^^^^ 

of cache lines; and 

•^r to each of the external memory accesses, 
inserting a caching instruction pnor to each 

of claim 19 wherein the method further comprises 
20. The machine-readable medium of claim 19, 

• V, address of each external memory access m the CAM 
seeking the base address ot the base address being sought, 

whether the CAM includes an entry that contains the base addr 
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c 1 .v, on wherein ifthe CAM includes an entry 
The machine-readable medium of claim 20 wherem 

containing the base address being sought, the method further compnses: 

u «n the entrv of the CAM contaimng 
determining an offset of the local memory based on the entry 

the base address being sought; and 
accessing data fton. an e„„yof*e,ocalme,„ory,efee„oedby*e—ed 

offset, 

T.e.achine.ea<iab.e™d.un,ofc>aUn20,whcr.nifa.eCAMdoesnc.inc.uaesa„ 

. • thP method further compnses 

entry containing the base address ocmg .^^s-^, - 

.A n RI n entrv of the CAM having a base address ot a 
allocating a least recently used (LRU) entry oi 

previous external memory access. 

,3 THe „ac.ine-.adaMe ™ed... of c,a.. 22. wH«ei„ .he .CHod funhe. comprises: 

,„ad>„.da.oracu„e.e.e™>«en,o.acc.sr„..ee.e™,.e.o.,n.ca„ 

e„,^ of .he local memory .terenced by *e allocaled LRU en«y, and 
.c.„„he base add.essof.hecu„e„.ex.en,a, memo, access in.eU.Uen..of 

*e CAM replacing *e base address of .he previous external memory access. 

Then.acM„e-readablemedi.morclaim.2.wherei„d.emeU.odfnr.erc^p^sc^^ 
examining.ebaseaddressof.he previous e.emaln.emoryaccessmd.ealloca«d 

UtU entry .o detennine whether dre base address is valid; and 
repUca..nsda.a of an en^yindte local memory correspond.s.od.e allocated LRU 

entry to a location of the external memory based address of the prevous 



external memory access. 



memory r^pectively to perform a plurality of thr^s substanf ally 
concurrently, each of the plurality of threads mcluding one or more 
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instructions performmg at least one external memory access based on a base 
address that is substantially identical, 
.herein the base address is examined in the CAM to determine whether the CAM 

includes an entry containing the base address, and 
wherein an entry of the local memory corresponding to the entry of the CAM is 
accessed without having to accessing the external memory, if the CAM 
includes the entry containing the base address. 

. . A T> 4 mirrne.neines comprises comprises 
The processor of claim 25, wherem me ^^vx ^ - - S 

a .eas. .ecenrty used (LRU, ,opc .o aUocate an LRU entry of *e CAM linking witt> an 
e„,ry of ,he local memory, wherem *e allocated LRU entry is used to cache the 
e«oma, memory access tor subsequent accesses to an identical location of the external 

memory. 

The data processmg,ystem of claim 26, wherein the LM comprises an indexing logrc 
LRU logic. 

A data processing system, comprising: 
a processor; 

a memory coupled to the processor; and 

a pro^ instruction, when executed ftom the memory, causes the processor to 

yenttfy a candidate representing a plurality of instructions of a plurahty of 
toeads that perfom> one or more external memory accesses, the 
external memory accesses having a substantially identical base 

address, and 

insert at least one of directives and instructions into an instmction stream 
correspondmg to the identified candidate to maintain contents of at 
least one of a content addressable memory (CAM) and local memory 
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(LM) of an executing processor executing the plurality of threads and 
modify at least one of the external memory access to access at least 
of the CAM and LM of the executing processor without having to 
perform the respective external memory access. 



to 
one 



The data processing system of data 27, wherein *e plurality of threads is executed by 
aptarality of microengmes of the executing processor respectively, and wherem each 
of the microengines of the executing processor includes a CAM and a LM. 



The daU, processing system of claim 28. wherein the CAM of the m.croengmes 
comprises a least recently used (LRU) logic to allocate an LRU entry of the CAM to 
cache a cutren, external memory access for subsequent identical external memory 
access, if the CAM does no. contain the base address of the cun.n. external memory 
access, and wherein the LM comprises an indexing logic to prov.de an index pomtrng 
to an entry of the LM based on a reference supplied by the LRU logic. 
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,;^pcTi, Arx OF THF r.Tsr.I.OSURE 

Auton>afc software con«oUed caching genemrions in network applications are 
described herein, h, one en,bodune„,, a candidate representing a plurality of instruotio.. of 
5 a plurality of threads that perfom, one or ntore external .cntory accesses is identified, where 
the external memory accesses have a substantially identical base address. One or ntore 
direcfves and/or instructions a« inserted into an instruction stream corresponding to the 
identified candidate .0 ntaint^ contents of at least one of acontent addressable mentoty 

._j ... ~„.4irv at least one of the external 
(CAMIandlocalmemoryCLMjofaprucesso.,.-.- . -• 

perform the respective external memory access. Other methods and apparatuses are also 

described. 
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